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Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2; 
initially named as 2019-nCoV) is responsible for the recent 
COVID-19 pandemic and polymerase chain reaction (PCR) is 
the current standard method for its diagnosis from patient 
samples. This study conducted a reassessment of published 
diagnostic PCR assays, including those recommended by the 
World Health Organization (WHO), through the evaluation 
of mismatches with publicly available viral sequences. An 
exhaustive evaluation of the sequence variability within the 
primer/ probe target regions of the viral genome was performed 
using more than 17 000 viral sequences from around the world. 
The analysis showed the presence of mutations/mismatches in 
primer/probe binding regions of 7 assays out of 27 assays 
studied. A comprehensive bioinformatics approach for in silico 
inclusivity evaluation of PCR diagnostic assays of SARS-CoV-2 
was validated using freely available software programs that can 
be applied to any diagnostic assay of choice. These findings 
provide potentially important information for clinicians, 
laboratory professionals and policy-makers. 


1. Introduction 


On 31 December 2019, a cluster of 41 pneumonia cases of unknown 
aetiology in Wuhan, China, were reported to the World Health 
Organization (WHO). Subsequently, a novel coronavirus of 
zoonotic origin, severe acute respiratory syndrome coronavirus 2 
(SARS-CoV-2; initially named as 2019-nCoV), was isolated from 
the patients [1-3]. The virus has spread to more than 200 
countries and territories resulting in global coronavirus disease 
2019 (COVID-19) pandemic [4]. The rapid spread of the virus is 
partially attributed to the transmission by asymptomatic carriers 
or mildly symptomatic cases [5,6]. Early diagnostic testing is an 
important tool for policy-makers to make public health decisions 
to contain the outbreak. 


© 2020 The Authors. Published by the Royal Society under the terms of the Creative 
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The virus from the patients was identified and seguenced early in the outbreak [1,7] and resulted in 
the development of several polymerase chain reaction (PCR) detection protocols by multiple national 
organizations that were published by the WHO [8]. In addition, several other methods have been 
developed and published in the literature recently [5,7,9-15]. However, the molecular diagnosis of 
SARS-CoV-2 may be jeopardized by potential preanalytical and analytical vulnerabilities including 
lack of harmonization of primers and probes [16]. Given the potential for the viruses to mutate, 
genetic variations in the viral genome at primer/probe binding regions can result in potential 
mismatches and false-negative results [17]. For example, primer and template mismatches have been 
reported to impede proper diagnosis of several viruses including influenza virus [18-21], respiratory 
syncytial virus [22], dengue virus [23], rabies virus [24], human immunodeficiency virus-1 [25,26] and 
hepatitis B virus [27,28]. 

SARS-CoV-2 is an enveloped positive-strand RNA virus classified as a member of family 
Coronaviridae in the genus Betacoronavirus along with SARS-CoV and Middle East respiratory 
syndrome (MERS)-CoV [29]. The sequence analysis of SARS-CoV-2 isolates has shown that its single- 
stranded RNA genome is approximately 30 kb in size [1,7,30]. Based on similarity with SARS-CoV, 
SARS-CoV-2 genome has been predicted to encode at least 10 open reading frames (ORFs) for 
structural and accessory proteins. As per current annotation (NC_045512.2), these viral ORFs encode 
replicase ORFlab, spike (5), envelope (E), membrane (M) and nucleocapsid (N), and at least six 
accessory proteins (3a, 6, 7a, 7b, 8 and 10) [31]. 

Human coronaviruses encode a proofreading exoribonuclease, nsp14-ExoN, for maintaining 
replication fidelity and thus have a relatively slower mutation rate than other RNA viruses [32,33]. 
SARS-CoV-2 encodes nsp14-ExoN as well [1], but mutations have been described in the genome for 
circulating SARS-CoV-2 [34-38]. Some laboratories have performed the alignment of diagnostic 
primers/probes with a limited number of viral sequences and have reported some mismatches [39,40] 
which may lead to false-negative results [41]. The use of several commercially developed diagnostic 
assays has also been permitted around the world with limited regulatory approval due to the 
pandemic emergency [42]. However, the limit of detection of these assays differs considerably and can 
also lead to false-negative results [43]. As there are already reports of false-negative diagnosis of 
COVID-19 [44-48], there is a need for verification of potential primer/probe mismatch with the 
sequences of viral isolates being isolated from around the world. The American Society for 
Microbiology COVID-19 International Summit held on 23 March 2020 recommended routine 
verification of sequence mutations in primer and probe binding regions of the viral genome for 
optimal virus detection [49]. 

The objective of this study is the in silico reassessment of previously published PCR primers/probes 
for COVID-19 diagnosis. This was performed through the evaluation of the sequence variability within 
the primer/probe target regions of SARS-CoV-2 viral isolates from around the world. The absence of any 
mutations and mismatches in target regions of the assay used would provide a higher degree of 
confidence in the test results obtained while the presence of mutations could help guide the strategies 
for the reassessment of diagnostic assays. We believe that these findings provide potentially important 
information for clinicians, laboratory professionals and policy-makers. 


2. Methods 


This study was pre-registered on the Open Science Framework (OSF); the accepted Stage 1 registration 
can be viewed at (https://osf.io/ym8gc). Minor deviations from protocol are identified in footnotes. 
The study design planner is included in table 1. The summary of the sequence tracing pipeline is 
shown in figure 1. 


2.1. Selection of primers and probes 


A total of 27 PCR primer-probe sets were selected based on literature review [9,10,12-15,50-52] and on 
the assays posted on WHO website [8] originally developed by seven different national institutions 
including Chinese Center for Disease Control and Prevention (China CDC), China; Institut Pasteur, 
Paris, France; US Centers for Disease Control and Prevention (CDC), USA; National Institute of 
Infectious Diseases, Japan; Charité - Universitátsmedizin Berlin Institute of Virology, Germany; 
The University of Hong Kong, Hong Kong; and National Institute of Health, Thailand. 
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Sequences download 
* Viral sequence data download from EpiCoV database (https://www.gisaid.org/) 


*Download the complete genome of Wuhan-Hu-1, NCBI Reference Sequence: NC. 045512.2 
(https://www.ncbi.nlm.nih.gov/nucleotide/) 


Multiple Sequence Alignment (MSA) using MAFFT program dedicated to viral genomes 
(https://mafft.cbrc.jp/alignment/server/) 

ethe complete genome of Wuhan-Hu-1 (NC. 045512.2) added in the "Existing alignment" box and other sequences in the 
"Fragmentary sequence(s)" box. 

*Parameters: UPPERCASE / lowercase, same as input; Direction of nucleotide sequences, same as input* ; Output order, 
Aligned; and "Keep alignment length" Yest. The aligned sequences downloaded in PIR format. 





Primer Alignment with MSA file 
*Primer/probes written in FASTA format 


*Adjustment of the direction as necessary using online Sequence Manipulation Suite 
"WWE (https://www.bioinformatics.org/sms2/rev comp.html) 


uud The primers/probes alignment with the MSA using 'add and align sequences from Clipboard' option of AliView program 
(https://ormbunkar.se/aliview/) 


Alignment inspection and Sequence trimming using AliView program (https://ormbunkar.se/aliview/) 
«Inspection of the primer/probes binding regions referred here as region of interest (ROI). 


as deett *ROI for each primer/probe saved as a separate file in FASTA format 
rocessing 


Sequence stratification using SequenceTracer (http://entropy.szu.cz:8080/EntropyCalcWeb/sequences) 

Exclusion of the sequences with stretches of NNNs, ambiguous sequences, and missing sequence in ROIt 

Settee "Download the list of sequence variants along with percentage frequencies 

CUEVAS + The data of the sequence variant groups with high frequency (if any) copied to the “Notes” and exported as a FASTA file 


Position Nucleotide Numerical Summary (PNNS) 


Nucleotide composition of each position of highly variable regions (if any) using PNNS calculator 
Base (http://entropy.szu.cz:8080/EntropyCalcWeb/pnns) 


Composition 





Figure 1. Sequence tracing pipeline used in the study. *The direction can be adjusted by selecting the option ‘Adjust direction 
according to the first sequence’, if needed. ‘The change was made with editorial approval after Stage 1. 


2.2. Sequencing data 


The complete genome sequences of the virus were downloaded from the Global Initiative on Sharing All 
Influenza Data (GISAID) EpiCoV database [53]. As of 7 May 2020, it hosted a total of 17 175 SARS-CoV-2 
sequences isolated from humans. By applying the complete genome (greater than 29 000 bp) filter, a total 
of 17 026 sequences were included in the study that are available upon free registration (https://www. 
gisaid.org/). SARS-CoV-2 is an RNA virus, but the data are shown in DNA format as per scientific 
convention. The sequences are shared by the laboratories around the world and a list of accession 
numbers is included in electronic supplementary material, file S1. It is recognized that this study is 
not immune to the geographical bias present in academic and scientific research. As the data were 
sampled from a global sequence database, it is possible that data may originate from high-income 
countries like the literature in other disciplines [54,55]. In addition, it is possible that data from certain 
countries or regions are excluded based on the exclusion criteria of low-quality data that may skew 
the data geographically. Another reason for possible data skew may be the origin of the current 
pandemic being China. Indeed, a recent study analysed the publications in COVID-19 literature hub 
LitCovid [56] and observed that more than 30% of articles were related to China [57]. These aspects of 
possible bias and data skew are addressed in the Discussion to make sure that the valid conclusions 
are drawn from the data in terms of geographical correlation. 


2.3. Multiple sequence alignment and alignment processing 


Multiple sequence alignment (MSA) was performed using MAFFT (Multiple Alignment with Fast 
Fourier Transform) program v. 7 dedicated to closely related viral genomes [58,59] available online 
(https://mafft.cbrc.jp/alignment/server/). The complete genome of Wuhan-Hu-1 downloaded from 
NCBI on 7 May 2020 was included as a reference, which is 29 903 bp long (NCBI Reference Sequence: 
NC_045512.2). The aligned sequences were downloaded in PIR format. Fach primer/probe was 
aligned with the MSA and the binding region referred to here as region of interest (ROI) was 
inspected using the AliView program 1.26 [60]. To evaluate the sequence variability in target regions 


of previously published primers/probes, the ROI for each primer/probe set was saved as a separate 
file in FASTA format. 
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2.4. Sequence variation in primer/probe binding regions in SARS-CoV-2 genome 


The MSA sequence for forward primer, probe and reverse primer were stratified using the 
SequenceTracer module (http: //entropy.szu.cz:8080 /EntropyCalcWeb /sequences) of the Alignment 
Explorer [61]. This tool segregated sequences into discrete groups of identical sequence variants along 
with their frequency for each primer/probe. The seguences with stretches of NNNS, ambiguous 
sequences in ROI and missing sequences' were excluded from the study. Subsequently, a threshold* 
(0.5% of all sequences included) was defined to remove extremely low prevalent variants and 
sequencing errors in the data as described previously [61]. Thus, only the sequence variants with at 
least 0.5% incidence were further considered. The viral isolates were reported as the frequency of hits 
with perfect primer match and hits with mismatches along with a summary of mutated nucleotides 
for each primer/probe. The distribution of the sequence variants in three primers/probes with the 
highest frequency of mismatches were analysed geographically. As the sequence variation was 
moderate, the base composition of each nucleotide position was not analysed. As noted in the 
registered Stage 1 protocol (https: //osf.io/ym8gc), this analysis can be performed using the positional 
nucleotide numerical summary (PNNS) calculator (http: //entropy.szu.cz:8080 / EntropyCalcWeb/ 
pnns) of the Alignment Explorer [61]. 


3. Results 


The sequence tracing pipeline (figure 1) was applied to the comprehensive sequence dataset of 17 027 
SARS-CoV-2 sequences for each PCR primer/probe. To determine the sequence variability in the 
primer/probe binding regions, all the sequences in the dataset were aligned using MAFFT. Next, for 
each PCR assay, the MSA file was trimmed to include only the primer or probe binding regions 
referred to here as ROI. The sequence file for each primer/probe was submitted to SequenceTracer to 
segregate into discrete groups of identical sequence variants and presented a detailed view of the 
nucleotide variation in each ROI along with the frequency of each variant (figures 2 and 3; electronic 
supplementary material, file 52). All the sequences showing ambiguous sequences were grouped as 
'outgroupl', short sequences were grouped as ‘outgroup?’ and missing sequences were grouped as 
'excluded'. These three groups were not included in the analysis (collectively referred here as 
‘removed’), and the number of ‘informative’ sequences was calculated by subtracting these three 
groups from the total number of sequences. The informative group was then divided into hits with a 
perfect match and hits with mismatches for each primer and probe (table 2). It is not surprising that 
most primer/probe binding regions show mutations/mismatches with at least a couple of sequences 
but some of those may be extremely low prevalent variants and sequencing errors in the data. To 
minimize the effect of such sequences on the analysis, a threshold of 0.5% was then defined where 
only the sequence variants with at least 0.5% incidence were further considered as described 
previously [61]. The frequency of the sequences with the perfect match and with mismatches was then 
calculated from sequences above the threshold for each primer and probe. The summary of the 
analysis for 27 assays is presented in table 2. 

It was observed that the primers/probe of 20 assays out of 27 assays tested showed a perfect match 
with the template at the defined threshold (table 2). It was further observed that the forward primer of 
CN-CDC-N showed three nucleotide mismatches with 18.8% of viral sequences (table 3 and figure 2a). In 
addition, the US-CDC-N-1 probe and the US-CDC-N-3 forward primer showed one mismatch with 1.6% 
and 1.2% viral sequences, respectively (table 3 and figure 3). The reverse primer of NIID-JP-N also 
showed one mismatch with all the sequences (table 3; electronic supplementary material, file 52). The 
probe of Chan-ORFlab showed one mismatch with 0.9% of sequences while one mismatch in the 
reverse primer for all the sequences (table 3; electronic supplementary material, file 52). One mismatch 
was also observed with all the sequences for the probe of Young-N (table 3; electronic supplementary 
material, file $2). Most of the mismatches observed were not near the 3’ end of primers but some 
were in the probe binding regions. Many diagnostic assays have included degenerate nucleotides to 
increase the inclusivity of the assay for SARS-CoV and bat-SARS-related CoVs, but in certain cases, 
this is even detrimental for inclusive detection of SARS-CoV-2. For example, the Charité-ORF1b 


"SequenceTracer removes the missing sequences in ROI. The exclusion criterion of missing sequences was clarified with editorial 
approval after Stage 1 acceptance and prior to observation of the data. 


?The threshold was decided before Stage 1 acceptance. However, it was not clearly mentioned in the Stage 1 protocol and a previous 
study was referenced only. 
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(a) CN-CDC-N 
Group Variant Frequency Forward Group Variant Frequency Probe Group Variant Frequency Reverse 
Number Count % 10 20 Number Count % 10 20 Number Count % 10 20 
gie ec Keele aech =p EN lui lr lan ada lar ls 
GGGGAACTTCTCCTGCTAGAAT TTGCTGCTGCTTGACAGATT CAGCTTGAGAGCAAAATGTCTG 
1 13533 JAAD .....ooooommo mo. ..o...o. 1 16939 99.483 ..................... 1 16905 99283 ....................... 
2 3129 18.377 aac................... 2 1 0.006 |......... esse ei. 2 9 0083 eegenen eh Boss 
3 85 A. ERR rg ae 3 1 GEB A APP siensnahnak 3 5 DÄIN, 2 see ger e EC t.. 
4 48 02382 = co cece Crcccccvcceseses 4 1 ODO —coooroooosoooosoo t.. 4 3 0.018 sao bencornnrsnnors.o. 
5 22 E ARA EE EE e... outgroupl 85 0.499 5 3 0.018 N EIE ESE EA IR 
Ou 6 9 N MR TEER EE ON n 6 2 OR oes as wese ER EE 
2 7 a AE onse M 7 2 MS. uses wenk eed ges deen 
fo 8 4 UIS ion sesde Ee EE DEd ED dd c 8 1 DOIG Essie Ts SR ER EE 
E 9 4 A RR s E... 9 1 DO ege ée ege ée Biss 
2 10 3 AUN. RE IR den 10 1 DEM N eene ie N ie 
= 11 3 0.018 ere eege äre 11 1 0.006 .....  TTTTTTTTTITTT 
12 3 DON. verc sn er 12 1 00885 ss aaas S Morus saa n 
13 3 6019 amg...E.oosoesosesovens 13 1 TE AE Ba ane 
14 2 o sade seen GE see dis Ross outgroupl 91 0.534 
15 1 GOOSE cnccsncccecese a eeneg outgroup2 1 0.006 
16 1 0006 ....... leeë GT 
17 1 BONE Mosia Kees se 
5 18 1 0.006 aa... cece eee eens 
9 19 1 0.0006 aac........... Eise sei 
O outgroupl 167 0.981 
£ outgroup2 3 0.018 
VY 
= (b) Charité-ORF1b 
Group Variant Frequency Forward Group Variant Frequency Probe Group Variant Frequency Reverse 
Number Count 96 10 20 Number Count 96 10 20 Number Count 96 10 20 
s222]2225]22-..]---.1-  — | ) | |... Ise leese les n" "rm m PT PTT | 
GTGARATGGTCATGTGTGGCGG CAGGTGGAACCTCATCAGGAGATGC TATGCTAATAGTGTSTTTAACATYTG 
R=A/G 1 16976 MED. esesceebans cee edna SzG/C Y=C/T 
1 16908 MIOL  ... BR. sooo es so oe ese 2 5 0029  .........* E. aeneteeeetsee 1 17002 992388 soos soe boo eee t.. 
2 30 0.176 sarro N 3 2 0.012 ....... Ga ccccsnnccccncses 2 1 00808 — ........ aas fe is b t.. 
3 5 0.029 e 8 ke, e Al A sms E Sen pe 4 1 0006 ..... heads sea saasae dei Es N 3 1 0008  Á...........-* 1 nee E.A: 
4 4 0.023 ecccoljeoccocceccccccccoccve outgroup1 43 0.253 outgroupl 23 0.135 
5 2 0.012 — CTTTTPTITIDITIT E. 
6 2 0.012 een ée ee e age Eaneenserz 
7 1 0.006 evocllossccecccusse boss 
8 1 0.006 aa.a.aa..a e ef, e 
9 1 0.006 se elo gängege eng a. 
10 1 0.006 seaofoesaes AA 
11 1 0.006 sacaran Bee As 


outgroupl 71 


Figure 2. Sequence variants in primers and probe binding regions for CN-CDC-N (a) and Charité-ORF1b (b): sequence variants in 17 
026 viral genome sequences aligned to the primer/probe binding regions (5^ — 3") along with the number of sequence variants 
and the frequency of each variant in descending order. The dots indicate an identical nucleotide. The horizontal double bar indicates 
the threshold (greater than or equal to 0.5%). The binding region of reverse primer is reverse complemented. As an example, the 
removed and informative sequences are indicated with vertical bars. outgroup1, ambiguous sequences; outgroup2, short sequences. 


reverse primer contains an S (G or C) but all the viral sequences (in total 17002) contain a T at this 
position (table 3 and figure 2b). Some of the other mutations observed in the primer/probe binding 
regions that did not pass the defined threshold include T13402G, C15540T, A28338G, C28846T, 
C28887T, C28896G, C29144T, T29148C and A29188T. Some of these are near the 3’ end of primers 
(figures 2 and 3; electronic supplementary material, file 52). 

The majority of the sequences included in this study originated from Europe (9410) and North 
America (4759), while there were only 136 sequences from Africa, 7 from Central America and 142 
from South America. The UK and the USA were among the countries with the highest number of 
sequences included (figure 4a; electronic supplementary material, file 53). The geographical 
distribution of the CN-CDC-N forward primer, US-CDC-N-1 probe and US-CDC-N-3 forward primer 
mismatches showed that it is distributed globally. However, mismatches with the CN-CDC-N forward 
primer were mostly found in Europe, while mismatches with the US-CDC-N-1 probe and the US- 
CDC-N-3 forward primer were found mostly in Australia and Asia (figure 4; electronic 
supplementary material, file S3). 


4. Discussion 


This study exhaustively evaluated the genetic diversity in the primer/probe binding regions of 27 
previously published SARS-CoV-2 diagnostic assays including those recommended by WHO. The 
data presented in this study show mismatches in seven assays, highlighting the need for keeping the 
assay current through regular verification of sequence variation in PCR primer/probe binding regions. 
The other 20 assays show a perfect match with 100% of sequences at the defined threshold of 0.5%. 
This observation is in line with the estimates of the moderate mutation rate in the SARS-CoV-2 
genome similar to the SARS-CoV genome [63,64]. It has been estimated that the mutation rate in the 
genome of coronaviruses is less than other RNA viruses while much higher than DNA viruses and 
the host [65,66]. Although all the sequences with mismatches were grouped in comparison to 
sequences with a perfect match, not all mismatches necessarily result in false-negative results. The 
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(a) US-CDC-N-1 








Group Variant Frequency Forward Group Variant Frequency probe Group Variant Frequency reverse 
Number Count 96 10 20 Number Count 96 10 20 Number Count 96 10 20 
nor «do. lo... 1... sonal dono dre rl. 59. — "ees sonetos Loch 
GACCCCAAAATCAGCGAAAT ACCCCGCATTACGTTTGGTGGACC CAGATTCAACTGGCAGTAACCAGA 
1 2156870 IO. ¡iia cora a ore CN 1 ÄR? ` CIUETID. ¿aaa ae a Conn 1 OGE. ria ade 
2 6 OOS ee os passe Diarra. 2 273 1.603 grade se GEOES BE SR DER EE 2 58 0.341 alos AAA 
3 3 0.018 CEA ak SR 3 7 OOM os sa EE SE ERNS SS t 3 25 IA ` seg Mirra 
4 2 DIA: sisi oia ts e. 4 6 OE wave REVERSE v v VR boss 4 6 EK ` e A d+; 
5 1 0.006 BER AT P VN M EIU 5 4 0.023 ROTSE SERE ETE Wee CH 5 4 GOS se Se erences we 
6 1 DEN. "ee ERN 6 3 0.018 Tte s TE EE EG 6 2 0.012 GEES SERE STIG EG 
7 1 0888: 20:7 9752 Bosco se 7 3 OMS: seu SSeS SSS 7 1 IEN ` IIIS VISWQC g.-. 
8 1 0.006 ORE RITES EDS DU 8 2 0.012 SEM SS EE SURE Eb IG 8 1 UOS ovis x 3 vs» ov Pv» LV s 
9 1 0.006 EE Ce Ee E N 9 2 LTE. ` EE EE das 9 1 6008 Je ed Buena Mae i 
10 1 UMD acia rmac sae a g. 10 2 OO? irse 2499 € 9908/9 t 10 1 MODE ascos acia aaa t 
outgroup1 40 0.235 11 2 DURER. Lone m rara oa 11 1 INN ` sage ege Da ae 
12 1 OAR era arranca arepa Dorm 12 1 ZE: ` Assam A 
13 1 0.006 a TO EE EE A 13 1 ODE O c 
14 1 GUS: sae osse Ed Die KO Be A ée 14 1 IER, ` Ae N sae ei 
15 1 OUS. ¡eran noxa ETE E outgroup1 48 0.282 
outgroup1 71 0.417 
outgroup2 1 0.006 
(b) US-CDC-N-3 
Group Variant Frequency Forward Group Variant Frequency probe Group Variant Frequency reverse 
Number Count % 10 20 Number Count % 10 20 Number Count % 10 20 
ola loci  —  —.. lees iEsveslesveslesecleséeos |. .-.— a iodo laa corras 
GGGAGCCTTGAATACACCAAAA AYCACATTGGCACCCGCAATCCTG CAATGCTGCAATCGTGCTACA 
1 16747 x ae onde ae Ede OS EK RE (Y=C/T) 1 16952 EA 
2 196 LIL ¿ura - ET ETE 1 16922 CHE AAA 2 27 DI irse EER dan 
3 13 DUO: ii O EIS 2 20 0.117 A A E t. 3 6 DOS se NA ne 
4 3 QUIE: nica Mr 3 16 0.094 Uria as 4 3 OSIE 4»... ve io 0 vane 
5 2 BOR oa A 4 9 0.053 wb ie scan Assis 5 1 0.006 BE ee tegen 
6 1 OOG — Luces em EDS TET 5 7 0.041 AEREAS e 6 1 0008 waces dieci ra oO sua Vo 
7 1 0086 — soon cmm Qf ioc a ooi 6 3 0.018 beca eoa RO STEERS 7 1 OROS Ee GE RR TR EE 
8 1 GOS ie OES ESE essa 7 2 0.012 ESE ES SE NES ss 8 1 OOS WS «Becas 
outgroup1 62 0.364 8 2 0.012 Ds Es sr ET 9 1 OOS Mira BI STR 
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Figure 3. Sequence variants in primers and probe binding regions for US-CDC-N-1 (a) and US-CDC-N-3 (b): sequence variants in 17 
026 viral genome sequences aligned to the primer/probe binding regions (5 — 3") along with the number of sequence variants and 
the frequency of each variant in descending order. The dots indicate an identical nucleotide. The horizontal double bar indicates the 
threshold (greater than or equal to 0.5%). The binding region of reverse primer is reverse complemented. outgroup1, ambiguous 
sequences; outgroup2, short sequences; excluded. 


effects of mismatch between primers/probes and template depend upon position and number of 
mismatches. Most of the mismatches observed in primers of SARS-CoV-2 diagnostic assays were not 
near the 3’ end and may be tolerated. Mismatches at the 3’ end are known for their deleterious effect 
on PCR amplification [17,67,68], but single mismatches, especially more than 5 bp far from the 3’ end, 
have a moderate effect on PCR amplification and are unlikely to significantly affect the assay 
performance [67]. Three assays showed a single nucleotide mismatch in the probe binding region. 
PCR amplification is more prone to mismatches in the probe region and even a single mismatch may 
reduce the sensitivity of the assay and lead to false-negative results due to the prevention of probe 
binding and subsequence fluorescence [22,28,69-71]. In the scenarios where mismatches were 
tolerated, one additional mutation resulted in reduced RI-qPCR sensitivity for the detection of 
influenza A virus [18]. 

Despite the ability of single mismatches to be tolerated, it is important to consider that mismatches 
need to be corrected if found in most of the viral sequences available. For example, the reverse primer 
of Charité-ORF1b shows a mismatch with all the viral sequences (a total of 17 002). This mismatch has 
also been observed in 990 viral sequences along with the lower sensitivity of this assay in a recent 
preprint [72]. Similarly, the NIID-JP-N reverse primer also shows a mismatch with all the sequences. 
This assay released by WHO was subsequently corrected by the authors in a separate study [51]. 
Although they show no difference in the performance of both assays, there is no apparent reason for 
not correcting the mismatch in the primer. The WHO recommended assays of SARS-CoV-2 were 
developed by multiple national organizations early in the outbreak with limited genomic sequence 
data available and have been instrumental for the diagnosis of COVID-19. However, some of the 
assays have not been reassessed in the light of the risk of mutations during viral evolution. Based on 
the analysis of 17027 viral sequences, this study demonstrates the presence of mutations/mismatches 
in the primer/probe binding regions of some published assays (table 3). Sequences adjustments to 
these primers/probes need to be assessed experimentally using viral strains or nucleic acid coupled 
with subsequent experimental performance using clinical samples. With increasing concern of false- 
negative COVID-19 diagnosis and poor sensitivity of diagnostic PCR in certain cases [73,74], 
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(a) viral sequence included in dataset (n = 17026) (b) CN-CDC-N F primer mismatches (n = 3129) 
0.83% 0.80% 
1.95% 2 0540 

E Africa E Asia ^ Central America ™ Europe E Africa E Asia ^ Central America — Europe 
E North America 8 Oceania B South America E North America — 9 Oceania Bi South America 

(c) US-CDC-N-1 probe mismatches (n = 273) (d) US-CDC-N-3 F primer mismatches (n = 196) 

0.37% 0.73% 
10.20% 

B Africa B Asia El Central America * Europe B Africa B Asia El Central America ™ Europe 
E North America Œ Oceania E South America E North America  M Oceania South America 


Figure 4. Geographical distribution of included sequences dataset (a) and mismatches for CN-CDC-N forward primer (b), US-CDC-N-1 
probe (c) and US-CDC-N-3 forward primer (d). The total number of sequences in each dataset is given in parentheses. Data used to 
draw graphs are included in electronic supplementary material, file S3. 


correcting the mismatches between primers/probes and template may help to improve the sensitivity of 
certain diagnostic assays. 

There have been recent efforts along the same line where a limited number of viral sequences were 
aligned with primers/probes to search for mismatches. One of the recent preprints used 992 sequences 
to report some variants in the primer/probe binding regions [72]. However, many of the mismatches 
could be rare variants or sequencing errors, and variability in the assay binding regions should be 
assessed across a larger number of viral sequences. In addition, the diagnostic assay should not be 
revised based on the presence of rare variants in the population and thus a threshold of 0.5% was 
defined to eliminate such variants from the analysis. Some of the mismatches observed by this preprint 
were confirmed in the larger dataset of the current study. Other variants were not observed or did not 
reach the threshold and thus were not reported in the final analysis. It cannot be excluded that 
empirical threshold adjustment of this study might have missed some significant variants. For instance, 
choosing a threshold of 0.2% would have resulted in a mismatch with five additional assays that were 
reported to match with 100% of sequences in the current analysis. Another recent preprint reported a 
bioinformatics system named 'BioLaboro' to assess the efficacy of the existing PCR assays to detect 
pathogens as they evolve [75]. However, this system requires specialized software and large RAM 
hardware which is not generally available in regular diagnostic or research laboratories. By contrast, the 
current study validates a pipeline for in silico re-evaluation of PCR diagnostic assays of SARS-CoV-2. 
This approach has successfully been applied previously for influenza A virus [61]. Using freely available 
open-source software, the analysis was performed on a regular desktop computer without any need for 
special hardware. The pipeline does not require extensive computational skills except for some sequence 
alignment skills. The pipeline can be applied to a SARS-CoV-2 diagnostic assay of choice. 

Verification of in silico nucleotide identity match, termed as inclusivity analysis, is also a component 
of the performance criteria of COVID-19 diagnostic assays by the U.S. Food and Drug Administration 
(FDA) as well as the European Commission [76,77]. Several commercially developed COVID-19 
diagnostic assays have received limited regulatory approval due to the emergency situation. As of 12 
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May 2020, a total of 54 commercial diagnostic test kits including the one developed by the US-CDC have ES 
received emergency use authorization (EUA) from the FDA [78]. The CDC has also reported one 
nucleotide mismatch in the N1 forward primer in their inclusivity assay using sequences available as 
of 1 February 2020 [62]. 5ome commercial kits like BD BioGX use CDC primers and thus do not 
conduct independent inclusivity analysis [79]. Many other kits have reported the alignment of their 
assay primers/probes with a couple of hundred sequences [80-85]. As primer/probe identity for most 
commercial kits is not revealed, manufacturer-independent data are scarce. Recent comparisons of 
SARS-CoV-2 diagnostic assays have shown some discordance which may partially be due to sequence 
differences [86,87]. Therefore, there is a need for comprehensive inclusivity assessment of commercial 
diagnostic assays. Although not addressed in this article, other factors for reassessment include 
in silico cross-reactivity with human genes, genes of other members of family Coronaviridae and other 
respiratory viruses / bacteria. 

The methodology outlined here uses MSA of publicly available viral sequences and is prone to certain 
biases despite its general utility in diagnostic PCR assay design. One of the biases is the compositional 
bias, which may arise as a result of sampling from certain geographical locations due to access to better 
facilities for viral genome sequencing or location of the outbreak. Based on a relatively moderate 
mutation rate in the genome, the results obtained can be applied globally, but caution should be 
exercised when drawing conclusions from the results for a specific region, especially with a smaller 
number of sequences included. Another possible geographical bias can arise due to the removal of 
data collected from certain countries or regions. However, the fact that less than 2.1% of sequences 
were removed for 73 out of 76 primers/probes studied mitigates this concern in the current study. 
The geographical analysis of the removed data (approx. 6%) of the remaining three primers/probes 
showed that most of the removed viral sequences were from Europe as expected (electronic 
supplementary material, file 54). Although the risk of data skew geographically cannot be ruled out 
completely, this much data exclusion is in line with previous reports [61]. Another source of 
compositional bias may be the redundancy where the same viral strain is re-sequenced and 
re-submitted to the sequence database. 
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Another source of bias may arise from the submission of isolates after passaging in the cell culture as 
well as sequencing artefacts including ambiguous data, short artificial insertions or deletions, incorrect 
sequence directions, incorrect nucleotide insertions, short sequence stretches and sequence longer than 
standard length [88]. Most data in the EpiCov database include the full-length data, and thus short 
sequences were not included in the study. To remove artificially inserted sequences and sequences at 
the ends, if any, MSA was performed with the option to keep the alignment length according to the 
reference sequence. In this methodology, no gaps are inserted in the reference sequence and 
corresponding sites in the other sequences are deleted. Therefore, this methodology can potentially 
remove any real insertions as well. However, only seven insertions affecting 31 sequences are 
catalogued in CoV-GLUE database (http: //cov-glue.cvr.gla.ac.uk/#/insertion) as of 22 May 2020 [89]. 
The use of SequenceTracer in the tracing pipeline successfully filters out ambiguous data and 
deletions [61]. As SequenceTracer removes all the sequences with short and missing sequences, a real 
deletion of a stretch of sequence would also be filtered out. However, only a few sequences were 
removed in the ‘outgroup2’ or in ‘excluded’ group (figures 2 and 3; electronic supplementary 
material, file 52). In line, none of the deletions affecting more than two sequences listed in CoV-GLUE 
database (http://cov-glue.cvr.gla.ac.uk/#/deletion) as of 22 May 2020 were found in the ROI under 
study. 


5. Conclusion 


This work outlines a comprehensive approach for the bioinformatics reassessment of PCR diagnostic 
assays for SARS-CoV-2. The application of this strategy on 27 previously developed assays using 
17027 viral sequences showed mutations/mismatches in primer/probe binding regions of seven 
assays. This information will act as a reference and may help re-evaluate COVID-19 diagnostic 
strategies. In silico analysis of primers/probes should be coupled with empirical testing on clinical 
samples and the primers/probes that work well in silico as well as empirically should be used in a 
diagnostic assay for SARS-CoV-2. 


Data accessibility. A list of accession numbers of sequences is included in electronic supplementary material, file S1. 
Sequence tracing figures of all the assays not shown in the main article are included in electronic supplementary 
material, file S2. Geographical data used to draw graphs in figure 4 are included in electronic supplementary 
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